pRactice corner: Exploratory Data Analysis

lruolin

R4DS Practice 05: Exploratory Data Analysis

The codes below are from the practice exercises in https://r4ds.had.co.nz/, and are taken with reference from: https://jrnold.github.io/r4ds-exercise-solutions/

Let’s begin now

Loading tidyverse package. The main packages used are dplyr and ggplot2.

library(tidyverse)

EDA

EDA is carried out to generate questions about your data, to search for answers by visualizing, transforming and modelling your data, and to use the insights to further generate new questions.

Visualizing variation

Visualizing distributions: Categorical - Use Bar Chart

count_cut <- diamonds %>% 
  count(cut) %>% 
  arrange(desc(n))

count_cut

# A tibble: 5 x 2
  cut           n
  <ord>     <int>
1 Ideal     21551
2 Premium   13791
3 Very Good 12082
4 Good       4906
5 Fair       1610

diamonds %>% 
  ggplot(aes(x = cut)) +
  geom_bar(fill = "coral")  +
  theme_classic()

Visualizing distributions: Numerical - Histogram

diamonds %>% 
  count(cut_width(carat, 0.5))  # cut_width() makes groups of width width.

# A tibble: 11 x 2
   `cut_width(carat, 0.5)`     n
   <fct>                   <int>
 1 [-0.25,0.25]              785
 2 (0.25,0.75]             29498
 3 (0.75,1.25]             15977
 4 (1.25,1.75]              5313
 5 (1.75,2.25]              2002
 6 (2.25,2.75]               322
 7 (2.75,3.25]                32
 8 (3.25,3.75]                 5
 9 (3.75,4.25]                 4
10 (4.25,4.75]                 1
11 (4.75,5.25]                 1

diamonds %>% 
  ggplot(aes(x = carat)) +
  geom_histogram(binwidth = 0.5, fill = "deepskyblue4", col = "black") +
  labs(title = "Histogram of carat size distribution in diamonds dataset",
       subtitle = "Most of the diamonds have a carat value between 0.25 and 0.75") +
  scale_x_continuous(n.breaks =  20) +
  theme_classic()

A histogram divides the x-axis into equally spaced bins and then uses the height of each bar to display the number of observations that fall in each bin.

To overlay multiple histograms:

diamonds %>% 
  ggplot(aes(x = carat, color = cut)) +
  geom_freqpoly(binwidth = 0.1) +
  theme_classic()

Typical Values

diamonds %>% 
  filter(carat<3) %>% 
  ggplot(aes(x = carat)) +
  geom_histogram(binwidth = 0.01, col = "coral") +
  scale_x_continuous(n.breaks = 20) +
  labs(title = "Histogram of diamond sizes < 3 carats.",
        subtitle = "Most diamonds are smaller than 1 carat") +
  theme_classic()

Unusual Values

diamonds %>% 
  ggplot(aes(x = y)) +
  geom_histogram(aes(x = y), binwidth = 0.5, fill = "deepskyblue3") +
  labs(title = "Histogram for Diamond Carat Sizes",
       subtitle = "Using coord_cartesian() to reveal outlier values that might otherwise not be obvious.") +
  scale_x_continuous(n.breaks = 20) +
  coord_cartesian(ylim = c(0, 20)) +
  theme_classic()

Exercise

Explore the distribution of each of the x, y, z variables in diamonds dataset.

glimpse(diamonds)

Rows: 53,940
Columns: 10
$ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22…
$ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very…
$ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J…
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, …
$ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1…
$ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, …
$ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 33…
$ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87…
$ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78…
$ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49…

diamonds %>% 
  select(x,y,z) %>% 
  summary() # summary statistics

       x                y                z         
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
 Median : 5.700   Median : 5.710   Median : 3.530  
 Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
 3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
 Max.   :10.740   Max.   :58.900   Max.   :31.800

# geom_histogram

hist_x <- diamonds %>% 
  select(x) %>% 
  ggplot(aes(x = x)) +
  geom_histogram(binwidth = 0.01, fill = "deepskyblue4") +
  labs(title = "Histogram for x",
       subtitle = "x has a multimodal distribution and is of the range of 0 - 10.74") +
  scale_x_continuous(n.breaks = 30) +
  theme_classic()


hist_y <- diamonds %>% 
  select(y) %>% 
  ggplot(aes(x = y)) +
  geom_histogram(binwidth = 0.01, fill = "deepskyblue4") +
  labs(title = "Histogram for y",
       subtitle = "y has a multimodal distribution and is of the range of 0 - 58.90. There seem to be outliers in this variable.") +
  scale_x_continuous(n.breaks = 30) +
  theme_classic()

(hist_z <- diamonds %>% 
  select(z) %>% 
  ggplot(aes(x = z)) +
  geom_histogram(binwidth = 0.01, fill = "deepskyblue4") +
  labs(title = "Histogram for z",
       subtitle = "z has a multimodal distribution and is of the range of 0 - 31.8 There seem to be outliers in this variable.") +
  scale_x_continuous(n.breaks = 30) +
  theme_classic())

gridExtra::grid.arrange(hist_x, hist_y, hist_z, nrow = 3)

x is likely to be length, y is likely to be breadth and z is likely to be depth.

diamonds %>% 
  ggplot(aes(x, y)) +
  geom_point() +
  labs(title = "Scatterplot of x vs y") +
  theme_classic()

diamonds %>% 
  ggplot(aes(x, z)) +
  geom_point() +
  labs(title = "Scatterplot of x vs z") +
  theme_classic()

The plots above make it easier to detect outliers in relation to other variables.

Explore the distribution of price

diamonds %>% 
  select(price) %>% 
  summary()

     price      
 Min.   :  326  
 1st Qu.:  950  
 Median : 2401  
 Mean   : 3933  
 3rd Qu.: 5324  
 Max.   :18823

diamonds %>% 
  select(price) %>% 
  ggplot(aes(x = price)) +
  geom_histogram(binwidth = 5)

diamonds %>% 
  select(price) %>% 
  ggplot(aes(x = price)) +
  geom_histogram(binwidth = 5) +
  labs(title = "Histogram for price distribution of diamonds",
       subtitle = "There is a strange gap between $1000 and $2000") +
  coord_cartesian(xlim = c(0, 3000)) +
  theme_classic()

How many diamonds are 0.99 carat? How many are 1 carat?

diamonds %>% 
  filter(carat == 0.99) %>% 
  count() # 23

# A tibble: 1 x 1
      n
  <int>
1    23

diamonds %>% 
  filter(carat == 1) %>% 
  count()

# A tibble: 1 x 1
      n
  <int>
1  1558

# This could be due to the price differences.

Missing Values

It is recommended to replace outlier values with missing values using mutate() and ifelse().

diamonds_b <- diamonds %>% 
  mutate(y_edited = ifelse(y<3 | y >20, NA, y))

diamonds_b %>% 
  ggplot(aes(x, y_edited)) +
  geom_point(na.rm = T)

Visualizing Covariation (behaviour between variables)

Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is the visualize the relationship between two or more variables.

Categorical and Continuous Variable: Boxplot

diamonds %>% 
  ggplot(aes(x = fct_rev(cut), y = price)) +
  geom_boxplot() +
  labs(title = "How price varies for different cuts",
       subtitle = "Ideal cuts cost lower than fair cuts",
       x = "Cut",
       y = "Price") +
  theme_classic() +
  coord_flip()

Exercise

Improve the visualization for departure times of cancelled vs non-cancelled flights

library(nycflights13)

glimpse(flights)

Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, …
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2,…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, …
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EW…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FL…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, …
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 20…

flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min/60
  ) %>% 
  ggplot(aes(x = cancelled, y = sched_dep_time)) +
  geom_boxplot() +
  labs(title = "Departure times of cancelled vs non-cancelled flights") +
  theme_classic()

Which variable in the diamonds dataset is the most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

glimpse(diamonds)

Rows: 53,940
Columns: 10
$ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22…
$ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very…
$ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J…
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, …
$ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1…
$ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, …
$ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 33…
$ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87…
$ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78…
$ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49…

# To investigate the relationship between price and carat

diamonds %>% 
  select(carat, price) %>% 
  ggplot(aes(carat, price)) +
  labs(title = "Relationship between price and carat of diamonds",
       subtitle = "The larger the diamond, the higher the price of diamonds") +
  geom_point() +
  theme_classic()

# To investigate the relationship between cut and price

diamonds %>% 
  select(cut, price) %>% 
  ggplot(aes(x = cut, y = price)) +
  geom_boxplot(notch = T) +
  labs(title = "Relationship between cut and price of diamonds",
       subtitle = "There is a weak negative relationship between cut and price.") +
  theme_classic()

# Relationship between color and price

diamonds %>% 
  select(color, price) %>% 
  ggplot(aes(x = fct_rev(color), y = price)) +
  geom_boxplot(notch = T) +
  labs(title = "Relationship between color and price of diamonds",
       subtitle = "There is a weak negative relationship between cut and price.",
       caption = "The scale of color goes from D (best) to J (worst)",
       x = "Color",
       y = "Price") +
  theme_classic()

# Relationship between price and clarity

diamonds %>% 
  select(clarity, price) %>% 
  ggplot(aes(x = clarity, y = price)) +
  geom_boxplot(notch = T) +
  labs(title = "Relationship between clarity and price of diamonds",
       subtitle = "There is a weak negative relationship between clarity and price.",
       caption = "The scale of clarity goes from I1 (worst) to IF (best)",
       x = "Clarity",
       y = "Price") +
  theme_classic()

Carat is the best predictor of diamond prices.

#Comparing carat and cut

diamonds %>% 
  select(carat, cut) %>% 
  ggplot(aes(carat, cut)) +
  geom_boxplot() +
  labs(title = "Relationship between cut and carat size",
       subtitle = "Bigger diamonds may not be of the best cut.") +
  theme_classic()

One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlier values”. One approach to remedy this problem is the letter value plot.

# install.packages("lvplot")
library(lvplot)

diamonds %>% 
  ggplot(aes(cut, price)) +
  geom_lv(col = "deepskyblue4") +
  theme_classic()

The letter value plots give more insights on how the data is distributed beyond the quantiles. Larger boxes mean that a larger proportion of the data falls within the boxed area.

Compare and contrast geom_violin() with a faceted geom_histogram() or a colored geom_freqpoly().

diamonds %>% 
  ggplot(aes(price)) +
  geom_histogram(col = "black", fill = "deepskyblue4") +
  labs(title = "Histogram showing price distribution of different cuts of diamonds",
       subtitle = "Histograms are good for comparing the overall shape of distributions across categories") +
  facet_wrap( ~cut, ncol = 1, scales = "free_y") +
  theme_classic()

diamonds %>% 
  ggplot(aes(cut, price)) +
  geom_violin(fill = "orange") +
  labs(title = "Violin plot showing distribution of prices for different cuts of diamonds",
       subtitle = "Violin plot shows the full distribution of data, and are useful if the distribution is multimodal.") +
  theme_classic() +
  coord_flip()

Two Categorical Variables

To visualize the covariation between categorical variables, dplyr can be used to compute the count, and then visualize using geom_tile()

diamonds %>% 
  count(color, cut)

# A tibble: 35 x 3
   color cut           n
   <ord> <ord>     <int>
 1 D     Fair        163
 2 D     Good        662
 3 D     Very Good  1513
 4 D     Premium    1603
 5 D     Ideal      2834
 6 E     Fair        224
 7 E     Good        933
 8 E     Very Good  2400
 9 E     Premium    2337
10 E     Ideal      3903
# … with 25 more rows

# visualize

diamonds %>% 
  count(color, cut) %>%  
  ggplot(aes(x = color, y = cut)) + # use the variable with longer labels on y axis
  geom_tile(aes(fill = n)) +
  scale_fill_gradient(low = "white", high = "deepskyblue4") # manual

To show the distribution of cut within color, the proportion of each cut within a color can be calculated.

diamonds %>% 
  count(color, cut) %>% 
  group_by(cut) %>% 
  mutate(proportion = n/sum(n)) %>% 
  ggplot(aes(color, cut)) +
  geom_tile(aes(fill = proportion)) +
  scale_fill_gradient(low = "white", high = "deepskyblue4") # manual

Two continuous variables

A scatter plot may be used to visualize the co-variation between two variables.

If the dataset is too big, one can bin one continuous variable such that it acts like a categorical variable, and then use boxplot or frequency polygon to visualize.